FinTech Fraud Detection — This notebook focuses on the Credit Card Fraud dataset
Step 1: Exploratory Data Analysis (EDA)¶
In this step, we will:
- Understand the dataset
- Check for missing values
- Visualize distributions
- Identify early indicators of fraudulent transactions
In [2]:
# Load libraries
!pip install plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import plotly.express as px
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("coolwarm")
Requirement already satisfied: plotly in ./myenv/lib/python3.12/site-packages (6.3.1) Requirement already satisfied: narwhals>=1.15.1 in ./myenv/lib/python3.12/site-packages (from plotly) (2.8.0) Requirement already satisfied: packaging in ./myenv/lib/python3.12/site-packages (from plotly) (25.0)
In [3]:
# Load the Dataset - Adjust path if necessary
Credit = pd.read_csv("/mnt/c/1.MorganeCanada/Project-2-/Data/CreditCard_FraudDetection.csv")
Credit.head()
Out[3]:
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
| 4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows × 31 columns
In [4]:
# Basic Overview - We’ll inspect shape, column types, missing values, and a few summary statistics.
print("Shape:", Credit.shape)
print("\nInfo:")
print(Credit.info())
print("\nMissing values:", Credit.isnull().sum().sum())
Credit.describe().T.head(10)
Shape: (284807, 31) Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB None Missing values: 0
Out[4]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Time | 284807.0 | 9.481386e+04 | 47488.145955 | 0.000000 | 54201.500000 | 84692.000000 | 139320.500000 | 172792.000000 |
| V1 | 284807.0 | 1.759088e-12 | 1.958696 | -56.407510 | -0.920373 | 0.018109 | 1.315642 | 2.454930 |
| V2 | 284807.0 | -8.251210e-13 | 1.651309 | -72.715728 | -0.598550 | 0.065486 | 0.803724 | 22.057729 |
| V3 | 284807.0 | -9.655224e-13 | 1.516255 | -48.325589 | -0.890365 | 0.179846 | 1.027196 | 9.382558 |
| V4 | 284807.0 | 8.321417e-13 | 1.415869 | -5.683171 | -0.848640 | -0.019847 | 0.743341 | 16.875344 |
| V5 | 284807.0 | 1.650335e-13 | 1.380247 | -113.743307 | -0.691597 | -0.054336 | 0.611926 | 34.801666 |
| V6 | 284807.0 | 4.248462e-13 | 1.332271 | -26.160506 | -0.768296 | -0.274187 | 0.398565 | 73.301626 |
| V7 | 284807.0 | -3.054652e-13 | 1.237094 | -43.557242 | -0.554076 | 0.040103 | 0.570436 | 120.589494 |
| V8 | 284807.0 | 8.777941e-14 | 1.194353 | -73.216718 | -0.208630 | 0.022358 | 0.327346 | 20.007208 |
| V9 | 284807.0 | -1.179734e-12 | 1.098632 | -13.434066 | -0.643098 | -0.051429 | 0.597139 | 15.594995 |
In [5]:
# Target Variable Distribution - The dataset is highly imbalanced, which will influence our modelling strategy later.
fig = px.histogram(Credit, x='Class', color='Class',
color_discrete_map={0: "skyblue", 1: "red"},
title="Fraud (1) vs Non-Fraud (0)",
text_auto=True)
# Show percentage in annotation (optional)
fraud_ratio = Credit['Class'].value_counts(normalize=True)[1] * 100
fig.update_layout(
annotations=[dict(
x=0.5,
y=1.05,
xref='paper',
yref='paper',
text=f"Fraudulent transactions represent only {fraud_ratio:.3f}% of total data",
showarrow=False,
font=dict(size=14))])
fig.show(renderer="notebook_connected")
In [6]:
# Transaction Amount Distribution
plt.figure(figsize=(8,5))
sns.histplot(Credit['Amount'], bins=100, kde=True)
plt.title("Distribution of Transaction Amounts")
plt.xlabel("Transaction Amount")
plt.show()
In [7]:
# Temporal Analysis - The Time variable represents seconds elapsed since the first transaction. We create an Hour feature to see if fraud clusters occur at specific times.
Credit['Hour'] = ((Credit['Time'] // 3600) % 24).astype(int)
# Create an interactive histogram
fig = px.histogram(
Credit,
x='Hour',
color='Class',
barmode='group', # side-by-side bars
color_discrete_map={0: "skyblue", 1: "red"},
title="Fraud Frequency by Hour of Day",
labels={'Class': 'Transaction Class', 'Hour': 'Hour of Day'},
text_auto=True)
fig.update_layout(
xaxis=dict(dtick=1), # show every hour tick
yaxis_title="Count",
legend_title="Class")
fig.show(renderer="notebook_connected")
In [8]:
# Correlation Analysis of amount with Fraud by hour
hourly_fraud_corr = Credit.groupby('Hour').apply(lambda x: x['Amount'].corr(x['Class'])).reset_index(name='Correlation')
# Create a color list: red if correlation > 0, blue if < 0
colors = ['#FF6B6B' if val > 0 else '#4D96FF' for val in hourly_fraud_corr['Correlation']]
plt.figure(figsize=(10,5))
sns.barplot(x='Hour', y='Correlation', data=hourly_fraud_corr, palette=colors)
plt.title("Correlation of Amount with Fraud by Hour")
plt.ylabel("Correlation with Fraud (Class)")
plt.xlabel("Hour of Day")
plt.show()
In [9]:
# Interactive Visualization (Plotly)
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook" # or "notebook_connected" / "jupyterlab"
fig = px.histogram(Credit, x="Amount", color="Class", nbins=60,
barmode="overlay", title="Transaction Amount by Fraud Status",
color_discrete_map={'0':'blue','1':'red'})
fig.show(renderer="notebook_connected")
Key Insights
- The dataset contains highly imbalanced classes (~0.17% fraud).
- Fraudulent transactions tend to occur more often at certain hours (check correlation plot).
- Scaling and class balancing will be essential for accurate modeling.
Next step
- Data cleaning,
- Feature engineering
- first ML models.
In [12]:
# Check the new data set
print("CREDIT CARD DATA")
print("Shape:", credit.shape)
print(credit.head())
print("\nColumns:", credit.columns)
print("\n" + "="*60 + "\n")
print("Hour distribution:")
print(credit['hour'].value_counts().sort_index())
print("\nNight transactions count:")
print(credit['is_night'].value_counts())
CREDIT CARD DATA
Shape: (284807, 34)
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V24 V25 V26 V27 V28 \
0 0.098698 0.363787 ... 0.066928 0.128539 -0.189115 0.133558 -0.021053
1 0.085102 -0.255425 ... -0.339846 0.167170 0.125895 -0.008983 0.014724
2 0.247676 -1.514654 ... -0.689281 -0.327642 -0.139097 -0.055353 -0.059752
3 0.377436 -1.387024 ... -1.175575 0.647376 -0.221929 0.062723 0.061458
4 -0.270533 0.817739 ... 0.141267 -0.206010 0.502292 0.219422 0.215153
Amount Class hour is_night amount_log
0 149.62 0 0.0 1 5.014760
1 2.69 0 0.0 1 1.305626
2 378.66 0 0.0 1 5.939276
3 123.50 0 0.0 1 4.824306
4 69.99 0 0.0 1 4.262539
[5 rows x 34 columns]
Columns: Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
'Class', 'hour', 'is_night', 'amount_log'],
dtype='object')
============================================================
Hour distribution:
hour
0.0 7695
1.0 4220
2.0 3328
3.0 3492
4.0 2209
5.0 2990
6.0 4101
7.0 7243
8.0 10276
9.0 15838
10.0 16598
11.0 16856
12.0 15420
13.0 15365
14.0 16570
15.0 16461
16.0 16453
17.0 16166
18.0 17039
19.0 15649
20.0 16756
21.0 17703
22.0 15441
23.0 10938
Name: count, dtype: int64
Night transactions count:
is_night
0 230393
1 54414
Name: count, dtype: int64
Are there more frauds during the night?
In [10]:
# Total amount of frauds.
fraud_df = credit[credit["Class"] == 1]
print("Total fraudulent transactions:", len(fraud_df))
# At night?
fraud_night_counts = fraud_df["is_night"].value_counts()
print("Fraud count by night/day:")
print(fraud_night_counts)
fraud_night_percent = fraud_df["is_night"].value_counts(normalize=True) * 100
print("\nFraud percentage by night/day:")
print(fraud_night_percent)
# Particular Hour?
fraud_by_hour = fraud_df.groupby("hour").size()
print(fraud_by_hour)
Total fraudulent transactions: 492 Fraud count by night/day: is_night 0 329 1 163 Name: count, dtype: int64 Fraud percentage by night/day: is_night 0 66.869919 1 33.130081 Name: proportion, dtype: float64 hour 0.0 6 1.0 10 2.0 57 3.0 17 4.0 23 5.0 11 6.0 9 7.0 23 8.0 9 9.0 16 10.0 8 11.0 53 12.0 17 13.0 17 14.0 23 15.0 26 16.0 22 17.0 29 18.0 33 19.0 19 20.0 18 21.0 16 22.0 9 23.0 21 dtype: int64